Repeat After Me: Transformers are Better than State Space Models at Copying

llm
research paper
Author

Santosh Sawant

Published

February 5, 2024

Transformers are the workhorse of modern sequence modeling, achieving remarkable performance on a variety of tasks, but they have unavoidable inefficiencies. Specifically, when it comes to memory and compute requirements to predict the next token of a sequence of length. Recently, there is growing interest in models that use a fixed-size latent state that does not depend on the sequence length, which we refer to as “generalized state space models” (GSSMs).

GSSMs have demonstrated impressive performance, but it is not yet clear what these models sacrifice for their improved efficiency, if anything. Well that’s what exactly this paper has found out. It seems that GSSMs are promising in terms of inference-time efficiency but are limited compared to transformer models on tasks that require copying from the input context.

To understand this gap in capabilities experimentation was carried out to copy strings of length that are exponential in the number of heads of the transformer under the following scenarios.

Overall transformers are better than GSSMs at copying from their input context. However, SSMs have many advantages over transformers when it comes to memory and computational complexity as well as generating long consistent text. Future work should focus on building hybrid architectures that endow state space models with an attention-like mechanism, allowing them to retrieve relevant pieces of text from their input. What do you think ?

So why size of input context is so much important in LLMs. In order to understand this let’s look at GPU level. Modern GPUs have a “problem”: they’re too fast. GPUs have become very fast at performing calculations, insomuch that the speed of computation (FLOPs) is much higher than the memory bandwidth (GB/s) or speed of data transfer between memory areas. For example, an NVIDIA A100 can perform 19.5 TFLOPs while having a memory bandwidth of 2TB/s, which is 40 times slower considering each operation is 32 bit.

This means that sometimes the bottleneck is not how many operations we perform, but how much data transfer our operations need, and that depends on the size and the quantity of the tensors involved in our calculations. For example, computing the same operation on the same tensor N time may be faster than computing the same operation on N different tensors, even if they have the same size, this is because the GPU may need to move the tensors around. That what happens during memory-intensive tasks such as copying long strings, retrieval and few-shot question answering.

So the goal should not only be to optimize the number of operations we do, but also minimize the memory access/transfers that we perform.